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ABSTRACT 

DAP systems have been in day to day use 
for 10 years and a large amount of user 
experience has been gained. The profile 
of user applications is similar to that 
of the MPP working group. 

Experience has shown that contrary to 
expectations, highly parallel systems 
provide excellent performance on so- 
called "dirty problems" such as the 
'physics' part of meteorological codes. 
The reasons for this observation are 
discussed. The arguments against 
replacing bit processors with floating 
point processors are also discussed. 

Keywords: DAP, bit serial processors, 
FORTRAN, balance. 

INTRODUCTION 

The DAP 12] is generally similar in 
concept to MPP but differs in some 
important ways. Probably the most ways 
in which the DAP differs from MPP are: 

a) Memory size : The system view of 
DAP has always been that of a memory 
module with processing capability and 
so emphasis has always been placed 
upon having the largest memory it was 
practical to use when the systems were 
designed. Thus the 64x64 DAP when 
delivered to QMC had 4K bits per PE, 
but this was enlarged to 16K bits per 
PE when the appropriate static RAMs 
became available at reasonable prices. 
User experience has shown that many 
applications are very memory hungry 
and there is little limit to the 
memory which users can exploit. Plans 
for DAP- 3 have a minimum memory of 64K 


bits per PE rising through 256K bits 
per PE to a current maximum of 1Mbit/ 

PE giving a potential total of 128M 
bytes of a 32x32 DAP-3 or }G bytes on 
a 64x64 system. 

b) Interconnection Network : The DAP 
DAP has a 2D NEWS network similar to 
that of the MPP but additionally has 
a second layer network of row and 
column highways. These highways 
provide facilities to rapidly fetch 
and broadcast data. The highways are 
connected to edge registers in the 
master control unit and permit data to 
be selected from any set of processors, 
one per row/column or broadcast to any 
of the processors in the row/column. 

As the same edge registers are used 
for both row and column highways, 
'corner turning' is also supported. 

The highways are not 'buses' but are 
essentially multi-way OR gates and so 
give the DAP the properties of an 
'associative processor' [3J. 

c) Software Systems : An extended 
FORTRAN dialect is the principal 
programming language for DAP use [4], 
This language supports array processing 
in data sets with either N or 

2 

N items on an N*N DAP. Reals with 3, 
4, 5, 6 or 7 and 8 byte precisions are 
fully supported, also available are 
1-8 byte integers and bit variables 
(logicals) . The flexibility and 
efficiency of the DAP-FORTRAN language 
is such that users very rarely need to 
use the assembler language APAL. 

Indeed, the use of APAL is actively 
discouraged and is only justified for 
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specialist bit manipulation algorithms 
arising in such areas as low precision 
image processing and customised-format 
arithmetic routines - floating point 
users never need APAL. 

Above the high level system language 
DAP-FORTRAN there exists a subroutine 
library structured on the NAG line 
containing a wide variety of useful 
routines. The most widely used 
routines include Random Number 
generators, FFT's, tridiagonal equation 
solvers , sorting routines and a number 
of general utilities for data re- 
arrangements. The performance of the 
random number routines is particularly 
spectacular [5], 

USER EXPERIENCE 

The DAP has been used for a wide range 
of applications in the scientific 
field. The range of topics is almost 
identical to that covered at this 
conference, so there is little point in 
detailing it here. The application 
area most represented is Monte Carlo 
Simulations in such areas as QCD, Ising 
models, molecular dynamics etc., much 
to the surprise of many so-called 
experts who felt that SIMD architectures 
were unsuitable for Monte Carlo work. 

In addition to scientific (floating 
point) work, the DAP has been used for 
a number of non-numeric problems 
including: 

- searching for large primes 

- textual data abstraction 

- information retrieval, and 

- graphics - in particular, molecular 
graphics. 

The bit serial nature of the processing 
elements has helped greatly in 
providing the flexibility to support 
more than just specialist number 
crunching. 


"DIRTY PROBLEMS ARE NICE1" 

The reaction of users to the DAP has 
been very favourable especially to 
reprogramming exercise. The 
expressive power of DAP-FORTRAN has 
allowed great reductions in program 
size and complexity, and much higher 
performance than expected for so-called 
"dirty problems", i.e. problems 
requiring many data dependent decisions. 

There is a natural tendency to assume 
that problems with many decisions are 
not optimum for the DAP/MPP type of 
computers as the natural way to 
implement them is by parallel 
conditional expressions of the type 

where (temperature<freezing point) 
Heat:=heat + latent_heat 
In DAP-FORTRAN this would be coded as 
HEAT (TEMPERATURE.LT. FREEZING POINT) 

= HEAT + LATENT_HEAT 

the array indexing type is called 

MASK-INDEXING. 

As only some processors will update 
their local values of the variables, 
the natural tendency is to assume that 
such code will demonstrate a lower 
speed-up over 'conventional' computers 
than, say, unconditional code. Our 
experience is that the reverse is the 
case and tasks of the above nature 
demonstrate better speed-up than the 
clean unconditional code! 

The somewhat paradoxical result is 
explained when one realises that one 
is comparing the performance of two 
machines rather than measuring some 
absolute computing performance, and 
what one is really observing is that 
arrays of bit-serial processors are 
better suited to "dirty code" than 
are 'conventional' word based 
architectures. The explanation is 
simple but difficult for many people to 
grasp. Bit-serial systems are 
primarily optimal for dealing with 
arrays of logical operations, as these 
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are essential single bit operations. 
Arithmetic is software produced out of 
lots of logical primitives, hence the 
worst performance of the system comes 
from pure floating point code! The 
greater the proportion of a problem 
that is purely logical, the greater 
the potential is for an array of bit- 
serial processors to outshine 
architectures based on floating point 
pipelines - which try to avoid 
conditionals . 

SYSTEM BALANCE 

Some users of DAP have argued that the 
system would be improved if the bit- 
serial processors were replaced by 
floating point units and it is not 
surprising to me to find a number of 
the MPP group making the same 
observation. Care should be taken in 
evaluating what is really being stated. 
There are three arguments which can be 
made against such changes. 

1) 'Dirty problems' will really 
become dirty as the arguments in the 
previous section become invalid. 

2) The parallelism in systems of equal 
cost becomes much less. The cost 

2 

(in say mm of silicon) of a 64 bit 
floating point processor is equivalent 
to the cost of more than 128 single bit 
PE's, hence one should realise that the 
option one is really comparing is that 
of say, 16K, 1 bit processors and 128 - 
64 bit processors. 

3) The inter-processor communication 
rate should be somehow in "balance" 
with the processor computational 
capabilities. I have attempted to put 
some flesh on this concept of balance 
[6], which I summarise here. 

For many parallel algorithms - 
especially those with results producing 
global properties of arrays, the 
computation is carried out in a loop 
structure of the form: 


K: = l } 

WHILE K<N begin 

move some data a "distance" K; 
perform a computation; 

K: =K*2 
END WHILE 

typical of such loops are - SUM, 
Product, Maximum, Minimum, FFT, 

Sorting etc. 

The loop is traversed log^N times 
so the cost is: 

(log^N) cost of a computational step + 

cost of shifts 'distance' 1+2+4+8.... 

On a mesh connected processor with edge 
size n and dimensionality d and 

N=n^ (for MPP n = 128, d = 2) , the 
cost of such shifts become d(n-l)* 
cost of moving a data item from one 
processor to its neighbout. 

The Balance Factor can be defined by: 

time for computational component 

B = 

time for data movement component 

if B = 1 the two costs are equal, i.e. 
the system is in balance. If B>>1, 
the computational time is dominating 
and we would gain in performance if 
more powerful processors were used. 

If B«1 , data movement costs dominate 
and little benefit would appear if the 
processors were speeded up without 
improving the data transmission speed. 

G = 1/B could be called the Granuality 
factor of the system as it gives an 
idea of the minimum recommended number 
of computational operations that must 
be performed per data transmission 
step. 

Computation of balance factors for 
systems such as MPP and DAP suggests 
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that for floating point operations the 
systems are in approximate balance and 
no benefit would be gained by using 
floating point PE's unless the 
interprocessor network were improved by 
multibit highways and/or higher 
dimensionality meshes. 

The bit serial processors approach seems 
to be an optimal route to providing 
very highly parallel systems which can 
support a large range of application 
areas . 

Although one could try and produce the 
computer with the worlds largest MFLOP 
rating by building arrays of 

2*^ floating point chips, there is 
little evidence that such machines 
would produce even 0.1% of their 
theoretical performance on anything 
other than highly artificial problems 
such as the Mandelbrot Set. 
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